I divided the programing part into following steps:
- Random selection of the centers
- Computing the norm-2 squared distance for rach points from the centeroids
- Assigning each points to neasrest centeroids
- Update the centorids with in the clusters
- Calculating the objective function
- Chekcing whether the convergence is occured
- Combining the above steps to establish the k-means algorithm
- Adjusting the K-means combined code, so it can easily digest the pictures and produce compressed images
- Running the algorithm on the pictures, with desired K values
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML
To start the algorithm, Let's randomly choose k observations from the data:
def init_centers(X, k):
"""
Randomly samples k observations from X as centers.
Returns these centers as a (k x d) numpy array.
"""
from numpy.random import choice
samples = choice(len(X), size=k, replace=False)
return X[samples, :]
Computing the distances
Implement a function that computes a distance matrix, S=(sij)
It should return a Numpy matrix S[:m, :k].
#####this part of the code was later modified later as it was taking huge time
def compute_d(X, centers):
m = len(X)
k = len(centers)
S = np.empty((m, k))
for i in range(m):
for j in range(k):
d_ij = dist_squared = np.sum(np.square(X[i] - centers[j]))
S[i, j] = d_ij
return S
#####this part of the code was later modified later as it was taking huge time
def compute_d3(X, centers):
m = len(X)
k = len(centers)
S = np.empty((m, k))
for i in range(m):
for j in range(k):
d_i = np.linalg.norm(X[i, :] - centers, ord=2, axis=1)
S[i, :] = d_i**2
return S
this is the time optimized code for distance calculation, thanks to this web address: https://medium.com/swlh/euclidean-distance-matrix-4c3e1378d87f
#####this is the time optimized code for distance calculation, thanks to this web address:
#https://medium.com/swlh/euclidean-distance-matrix-4c3e1378d87f
def compute_d2(X, centers):
#S= np.dot(X.T, X) - 2* np.dot(X.T, centers) + np.dot(centers.T, centers)
#S= np.dot(X, X.T) - 2* np.dot(centers,X.T) + np.dot(centers,centers.T)
p1 = np.sum(X**2,axis=1)[:,np.newaxis]
p2 = np.sum(centers**2, axis=1)
p3 = -2 * np.dot(X,centers.T)
S = p1+p2+p3
return S
Now that we have our centeroids and also calculated the distances from our sample points to each cluster, we can assigan each point to it's corresponding cluster, based on the minimum calculated distance
def assign_cluster_labels(S):
return np.argmin(S, axis=1)
the centeroids need to be updated as per the updated labels
def update_centers(X, y):
m, d = X.shape
k = max(y) + 1
assert m == len(y)
assert (min(y) >= 0)
centers = np.empty((k, d))
for j in range(k):
# Compute the new center of cluster j,
centers[j, :d] = np.mean(X[y == j, :], axis=0)
return centers
Given the squared distances, return the within-cluster sum of squares:
def WCSS(S):
return np.sum(np.amin(S, axis=1))
Lastly, here is a function to check whether the centers have moved, given two instances of the center values. It accounts for the fact that the order of centers may have changed.
def has_converged(old_centers, centers):
return set([tuple(x) for x in old_centers]) == set([tuple(x) for x in centers])
Put all of the preceding building blocks together to implement k -means algorithm:
def kmeans(X, k,max_steps=np.inf):
import math
centers = init_centers(X, k)
converged = False
labels = np.zeros(len(X))
i = 1
while (not converged) and (i <= max_steps):
old_centers = centers
S = compute_d2(X, centers)
labels = assign_cluster_labels(S)
centers = update_centers(X, labels)
converged = has_converged(old_centers, centers)
if math.isnan(WCSS (S)):
print ("####empty cluster, you need to lower the number of clusters,k")
return labels + 0.5
else:
i += 1
print("#####")
print ("number of required iterations to converge=", i)
return labels
Read image and store it as a 3 dimensional image array
from PIL import Image
from matplotlib.pyplot import imshow
%matplotlib inline
def read_img(path):
"""
Read image and store it as an array, given the image path.
Returns the 3 dimensional image array.
"""
img = Image.open(path)
img_arr = np.array(img, dtype='int32')
img.close()
return img_arr
def display_image(arr):
"""
display the image
input : 3 dimensional array
"""
arr = arr.astype(dtype='uint8')
img = Image.fromarray(arr, 'RGB')
imshow(np.asarray(img))
the cluster_image function does the followings:
def cluster_image(image,centers_Number):
r, c, l = image.shape
img_reshaped = np.reshape(image, (r*c, l), order="C")
labels = kmeans(img_reshaped, centers_Number)
ind = np.column_stack((img_reshaped, labels))
centers_dic = {}
for i in set(labels):
c = ind[ind[:,3] == i].mean(axis=0)
centers_dic[i] = c[:3]
img_clustered = np.array([centers_dic[i] for i in labels])
return img_clustered
convert the above clustered image to a 3D Matrix, similar to the original image, for furthure visulaization
def compress_image (cluster_image, image):
r, c, l = image.shape
return np.reshape(cluster_image, (r, c, l), order="C")
the following function combines the above written functions into one and monitor the timing of the process. Also if any cluster is empty, with no points associate, it warns the user to narrow down the k, cluster numbers.
def KMeans_on_image (input_image, centers_Number):
import time
from datetime import timedelta
img_arr = read_img(input_image)
start = time.time()
img_clusterd = cluster_image (img_arr,centers_Number)
end = time.time()
Delta = end - start
Elapsed=str(timedelta(seconds=Delta))
print("Time taken for applying k-means algorithm on {} with K={} is: \n {}".format(input_image,centers_Number, Elapsed))
img_compressed = compress_image (img_clusterd, img_arr)
return img_compressed
####for k-means running on list of K=[2,4,8,16], 'football' iamge
Compressed_Image={}
for k in [2,4,8,16]:
Compressed_Image[k]=KMeans_on_image ("football.bmp", k)
#plotting the original and compressed images, 'football'
import matplotlib.pyplot as plt
intro_fig, ax = plt.subplots(1,1, figsize=(8,4))
ax.imshow(plt.imread('football.bmp'))
ax.set_title('Original\n', fontsize=20, weight='bold')
f, axarr = plt.subplots(2,2,figsize=(17.5,8.75))
axarr[0,0].imshow(Compressed_Image[2]/255)
axarr[0,0].set_title('K=2', fontsize=20, weight='bold')
axarr[0,1].imshow(Compressed_Image[4]/255)
axarr[0,1].set_title('K=4', fontsize=20, weight='bold')
axarr[1,0].imshow(Compressed_Image[8]/255)
axarr[1,0].set_title('K=8', fontsize=20, weight='bold')
axarr[1,1].imshow(Compressed_Image[16]/255)
axarr[1,1].set_title('K=16', fontsize=20, weight='bold')
####for k-means running on list of K=[2,4,8,16], 'beach' iamge
Compressed_Image_B={}
for k in [2,4,8,16]:
Compressed_Image_B[k]=KMeans_on_image ("beach.bmp", k)
#plotting the original and compressed images, 'beach'
#import matplotlib.pyplot as plt
intro_fig, ax = plt.subplots(1,1, figsize=(8,4))
ax.imshow(plt.imread('beach.bmp'))
ax.set_title('Original\n', fontsize=20, weight='bold')
f, axarr = plt.subplots(2,2,figsize=(17.5,8.75))
axarr[0,0].imshow(Compressed_Image_B[2]/255)
axarr[0,0].set_title('K=2', fontsize=20, weight='bold')
axarr[0,1].imshow(Compressed_Image_B[4]/255)
axarr[0,1].set_title('K=4', fontsize=20, weight='bold')
axarr[1,0].imshow(Compressed_Image_B[8]/255)
axarr[1,0].set_title('K=8', fontsize=20, weight='bold')
axarr[1,1].imshow(Compressed_Image_B[16]/255)
axarr[1,1].set_title('K=16', fontsize=20, weight='bold')
Due to nature of this image, alots white constant pixel, the k-means clustering faced empty clusters for number of k-values, mostly for the higher K values. The code however didnot halt and also reported the empty clusters accordingly.
####for k-means running on list of K=[2,4,8,16], 'my_PIC.JPG' iamge
Compressed_Image_M={}
for k in [2,4,8,16]:
Compressed_Image_M[k]=KMeans_on_image ("my_PIC.JPG", k)
For the above run images associated with K=8 & k=16 had empty clusters, but depending on the initial centorids that varies for different K numbers. Somtimes even for K=2 I faced empty clusters, and even sometime I had an ok image for k=8.
#plotting the original and compressed images, 'my_PIC.JPG'
#import matplotlib.pyplot as plt
intro_fig, ax = plt.subplots(1,1, figsize=(8,4))
ax.imshow(plt.imread('my_PIC.JPG'))
ax.set_title('Original\n', fontsize=20, weight='bold')
f, axarr = plt.subplots(2,2,figsize=(17.5,8.75))
axarr[0,0].imshow(Compressed_Image_M[2]/255)
axarr[0,0].set_title('K=2', fontsize=20, weight='bold')
axarr[0,1].imshow(Compressed_Image_M[4]/255)
axarr[0,1].set_title('K=4', fontsize=20, weight='bold')
axarr[1,0].imshow(Compressed_Image_M[8]/255)
axarr[1,0].set_title('K=8', fontsize=20, weight='bold')
axarr[1,1].imshow(Compressed_Image_M[16]/255)
axarr[1,1].set_title('K=16', fontsize=20, weight='bold')
####for k-means running on list of K=[2,4,8,16], 'Food.JPG' iamge
Compressed_Image_F={}
for k in [2,4,8,16]:
Compressed_Image_F[k]=KMeans_on_image ("Food.JPG", k)
#plotting the original and compressed images, 'Food.JPG'
#import matplotlib.pyplot as plt
intro_fig, ax = plt.subplots(1,1, figsize=(8,4))
ax.imshow(plt.imread('Food.JPG'))
ax.set_title('Original\n', fontsize=20, weight='bold')
f, axarr = plt.subplots(2,2,figsize=(17.5,8.75))
axarr[0,0].imshow(Compressed_Image_F[2]/255)
axarr[0,0].set_title('K=2', fontsize=20, weight='bold')
axarr[0,1].imshow(Compressed_Image_F[4]/255)
axarr[0,1].set_title('K=4', fontsize=20, weight='bold')
axarr[1,0].imshow(Compressed_Image_F[8]/255)
axarr[1,0].set_title('K=8', fontsize=20, weight='bold')
axarr[1,1].imshow(Compressed_Image_F[16]/255)
axarr[1,1].set_title('K=16', fontsize=20, weight='bold')
We should note the above numbers will certinly vary, since each run use random inital points for the cetroids assignemnts, and that affect some variation in between of each run results. is also dependent on the inital results as k-mean in non-convex and we get the local minimal not the global one.
The k-mean algorithm is actually a non convex optimization, and different initialization centroids, out of all the possible potential assignemnts, determine cluster assignments and consequently cetroids for upcoming runs. In order to have examined all the posiibilites of the ultimate answers, we need to run the alogorithm for $k^m$, for $k$ cluster numbers and $m$ datapoints. That's the way to have examined all the Local minimum and finding our the true Global minimum. The following picture is from the course material:

To better illustrate this I ran the algorithm on the football image, for 10 times and with 32 clusters and reported the compressed images and corresponding stats:
####for k-means running on list of K=[32],10 times, football.bmp
Compressed_Image_F_32={}
for i in range(10):
Compressed_Image_F_32[i]=KMeans_on_image ("football.bmp", 32)
As we can see there is wide range of differences between the runs.
f, axarr = plt.subplots(5,2,figsize=(15,30))
axarr[0,0].imshow(Compressed_Image_F_32[0]/255)
axarr[0,0].set_title('1st Run (K=32)', fontsize=20, weight='bold')
axarr[0,1].imshow(Compressed_Image_F_32[1]/255)
axarr[0,1].set_title('2nd Run (K=32)', fontsize=20, weight='bold')
axarr[1,0].imshow(Compressed_Image_F_32[2]/255)
axarr[1,0].set_title('3rd Run (K=32)', fontsize=20, weight='bold')
axarr[1,1].imshow(Compressed_Image_F_32[3]/255)
axarr[1,1].set_title('4th Run (K=32)', fontsize=20, weight='bold')
axarr[2,0].imshow(Compressed_Image_F_32[4]/255)
axarr[2,0].set_title('5th Run (K=32)', fontsize=20, weight='bold')
axarr[2,1].imshow(Compressed_Image_F_32[5]/255)
axarr[2,1].set_title('6th Run (K=32)', fontsize=20, weight='bold')
axarr[3,0].imshow(Compressed_Image_F_32[6]/255)
axarr[3,0].set_title('7th Run (K=32)', fontsize=20, weight='bold')
axarr[3,1].imshow(Compressed_Image_F_32[7]/255)
axarr[3,1].set_title('8th Run (K=32)', fontsize=20, weight='bold')
axarr[4,0].imshow(Compressed_Image_F_32[8]/255)
axarr[4,0].set_title('9th Run (K=32)', fontsize=20, weight='bold')
axarr[4,1].imshow(Compressed_Image_F_32[9]/255)
axarr[4,1].set_title('10th Run (K=32)', fontsize=20, weight='bold')
Within the compressed images also, even though it may not be so easy to point out, but there are differences too. For an example, I have re-plotted the images from the 5th and 9th runs, below. Look at the hats for the palyers in these two images, surely there are visible difference:
f, axarr = plt.subplots(1,2,figsize=(15,30))
axarr[0].imshow(Compressed_Image_F_32[4]/255)
axarr[0].set_title('5th Run (K=32)', fontsize=20, weight='bold')
axarr[1].imshow(Compressed_Image_F_32[8]/255)
axarr[1].set_title('9th Run (K=32)', fontsize=20, weight='bold')
I have also tested higher K values for the Beach image and my observation was, avereage wise, after around K=60 I faced empty clusters more often. But the empty cluster happened for lower k values as well:
for k in [20,30,40,50]:
Compressed_Image_B[k]=KMeans_on_image ("beach.bmp", k)
for k in [52,54,56,58]:
Compressed_Image_B[k]=KMeans_on_image ("beach.bmp", k)
for k in [60,62,64,66]:
Compressed_Image_B[k]=KMeans_on_image ("beach.bmp", k)